Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79
Draft
conradbzura wants to merge 20 commits intomainfrom
Draft
Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79conradbzura wants to merge 20 commits intomainfrom
conradbzura wants to merge 20 commits intomainfrom
Conversation
Column-to-column INTERSECTS joins (e.g., a.interval INTERSECTS b.interval) are now rewritten into binned equi-joins using CTEs with UNNEST(range(...)) bin assignments. This gives the query planner an equi-join key to work with instead of forcing a nested-loop or cross join. The bin size defaults to 10,000 and is configurable via the new bin_size parameter on transpile(). Literal-range INTERSECTS filters remain unchanged.
Needed for end-to-end correctness tests that validate the binned equi-join SQL against DataFusion's query engine.
The transformer now detects column-to-column INTERSECTS in WHERE clauses (FROM a, b WHERE a.interval INTERSECTS b.interval), not just in explicit JOIN ON conditions. Both patterns are rewritten to binned equi-joins for the same performance benefit.
Covers both explicit JOIN ON and implicit cross-join patterns, custom bin sizes, custom column mappings, self-joins, literal range passthrough, and end-to-end correctness against DataFusion including multi-bin deduplication and equivalence with naive joins.
Move the overlap predicate (start < end AND end > start) from WHERE into the JOIN ON clause so that LEFT/RIGHT/FULL JOIN semantics are preserved — a WHERE filter on the right-side columns silently converts outer joins into inner joins. Also refactor the transformer to rewrite all INTERSECTS joins in a query, not just the first. A new _ensure_table_binned helper tracks which aliases already have binned CTEs so that multi-join queries reuse CTEs instead of duplicating them. Add bin_size validation (must be positive) and remove dead code from _rewrite_where.
Cover three-way joins with CTE reuse, invalid bin_size rejection, and update assertions for the overlap-in-ON change. Remove unused pytest import from module level.
The binned CTE approach leaks __giql_bin into SELECT * results because CTEs expose all their columns. Revert implicit cross-join rewriting (FROM a, b WHERE INTERSECTS) so those queries fall through to the generator's naive overlap predicate, which produces clean column output. Explicit JOIN ON INTERSECTS continues to use the binned equi-join. Also add pytest.importorskip for datafusion so the DataFusion correctness tests are skipped when the module is not installed.
The CI workflow uses pixi, not uv, so the datafusion package must be listed under [tool.pixi.dependencies] for the DataFusion correctness tests to run. Remove the pytest.importorskip guard since the dependency is now always available.
The previous approach replaced FROM/JOIN table references with full
CTEs (SELECT *), causing __giql_bin to appear in SELECT a.* output.
The new approach keeps original table references and routes the equi-
join through key-only bridge CTEs (SELECT chrom, start, end, bin),
eliminating the leak entirely.
This also restores implicit cross-join rewriting (FROM a, b WHERE
INTERSECTS) which was reverted in the prior commit due to the leak.
CTEs are now named __giql_{table}_bins and deduplicated per underlying
table name rather than per alias, so self-joins share one CTE.
Queries with explicit column lists (SELECT a.chrom, b.start, ...) cannot expose __giql_bin in their output regardless of which CTE the table alias points to. Detecting this at transform time lets us skip the 3-join bridge pattern entirely for those queries and use the simpler, faster 1-join full-CTE approach. Queries with wildcards (SELECT a.*, SELECT *) still take the bridge path so __giql_bin never leaks into the output column set.
Drop section divider lines (`# --...--`) from `IntersectsBinnedJoinTransformer` to reduce visual clutter. Descriptive inline comments explaining code behavior are preserved.
Cover outer join semantics (LEFT/RIGHT/FULL preserved through both full-CTE and bridge paths), additional ON conditions surviving the rewrite alongside INTERSECTS, and unconditional DISTINCT collapsing legitimate duplicate rows. The DISTINCT tests are marked xfail since the correct behavior (preserving duplicates) is a known limitation. 7 tests fail against the current implementation, confirming the bugs. 2 tests are strict xfail documenting the DISTINCT limitation.
Two interrelated fixes for the binned equi-join rewrite: The bridge path was silently converting LEFT/RIGHT/FULL joins to INNER because sqlglot stores the join type as "side" not "kind", and only join3 received it. Propagate the side attribute to both join2 and join3. FULL OUTER with wildcards falls back to the full-CTE path because the three-join chain's bin fan-out creates spurious unmatched rows that DISTINCT cannot resolve. Both rewrite paths were replacing the entire ON clause with the binned equi-join and overlap predicate, silently dropping any user-supplied conditions alongside INTERSECTS. Extract non- INTERSECTS conditions from the original ON and AND them back into the rewritten clause.
DISTINCT is added unconditionally to column-to-column INTERSECTS joins to eliminate duplicates from the bin fan-out. This section explains the mechanism, the edge case where it can collapse genuinely identical source rows, and the mitigation of including any distinguishing column in the SELECT list.
Move DEFAULT_BIN_SIZE to constants module and export from __init__. Extract shared _build_bin_range helper to eliminate duplicate bin-computation logic between the two CTE builders. Replace the mutable-list connector counter with itertools.count. Add isinstance check for bin_size so floats are rejected early. Rewrite _remove_intersects_from_where to use _extract_non_intersects so deeply-nested AND trees are handled cleanly. Expand docstrings on the class, __init__, _find_column_intersects_in, and _build_join_back_joins to document assumptions and limitations.
Use hypothesis to generate random intervals spanning multiple bins and verify that the binned equi-join produces identical results to bedtools intersect -u. Three tests cover two-table joins, self-joins, and varying bin sizes (100 to 100k). Intervals use unique names to avoid the known DISTINCT duplicate-collapse limitation.
Two correctness fixes for the binned equi-join rewrite: 1. Bin index rounding: CAST(start / B AS BIGINT) uses float division, so values like 621950/100 = 6219.5 round to 6220 instead of flooring to 6219. Replace Div+Cast with IntDiv (//) which does proper integer floor division on all engines. 2. Outer join spurious NULLs: when an interval spans multiple bins, the LEFT/RIGHT/FULL outer join produces one row per bin. Bins that don't match the other side create NULL rows even though the same source row matches via a different bin. DISTINCT can't collapse these because NULL and non-NULL rows differ. Add a pairs-CTE approach that computes matching (left_key, right_key) pairs via an INNER binned join with DISTINCT, then outer-joins the original tables through this pairs CTE. This matches the pattern used by Databricks and Snowflake, which restrict binning to INNER joins and use separate logic for outer join semantics.
Add regression tests for both bugs found by property-based testing: - TestBinnedJoinBinBoundaryRounding: verifies that overlaps at .5 division boundaries are not dropped by float rounding (DuckDB). - TestBinnedJoinOuterJoinMultiBin: verifies that LEFT, RIGHT, and FULL OUTER joins with multi-bin intervals produce no spurious NULL rows (DataFusion). Add property-based bedtools correctness tests: - test_multi_table_join_matches_bedtools: three-way INTERSECTS join compared against chained bedtools intersect. - test_left_join_matches_bedtools_loj: LEFT JOIN INTERSECTS compared against bedtools intersect -loj. Add -loj support to the bedtools wrapper for left outer join output.
Extend the bedtools wrapper to support -v, -wa -wb, -c, -wo, -wao, -f, -F, and -r flags. Add property-based tests that compare GIQL queries against each bedtools intersect mode: inverse (-v via LEFT JOIN anti-join), write-both (-wa -wb via full pair SELECT), count (-c via GROUP BY COUNT), same-strand (-s), opposite-strand (-S), minimum overlap fraction of A (-f), minimum overlap fraction of B (-F), and reciprocal fraction (-f -r). Total: 520 randomized examples across 13 property-based tests covering all bedtools intersect overlap flags.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
IntersectsBinnedJoinTransformerto rewrite column-to-columnINTERSECTSjoins into binned equi-joins usingUNNEST(range(...))CTEs. The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL — no runtime extensions required. Three rewrite strategies are selected automatically: a full-CTE path for non-wildcard SELECTs, a key-only bridge CTE pattern for wildcard SELECTs, and a pairs-CTE path for outer joins.SELECT DISTINCTis added unconditionally to deduplicate rows from multi-bin matches.Outer joins (LEFT, RIGHT, FULL) use a dedicated pairs-CTE approach that computes matching key pairs via an INNER binned join, then outer-joins the original tables through this pairs CTE. This avoids the bin fan-out that creates spurious NULL rows when an interval spans multiple bins but only matches in some of them — matching how Databricks and Snowflake handle the problem (they restrict binning to INNER joins entirely).
Bin indices use integer floor division (
//) instead of float division + CAST to avoid rounding errors at bin boundaries.Closes #78
Proposed changes
Binned equi-join transformer
Add
IntersectsBinnedJoinTransformerinsrc/giql/transformer.py. Each interval is assigned to bins viaUNNEST(range(start // B, (end - 1) // B + 1)). The transformer handles explicitJOIN ON, implicit cross-join (FROM a, b WHERE ...), self-joins, multi-table joins, and custom column mappings. Bin size defaults toDEFAULT_BIN_SIZE(10,000) and is configurable via thebin_sizeparameter ontranspile().Three rewrite strategies
SELECT *, __giql_binCTEs and rewrite the JOIN ON. Used when the SELECT list has no wildcards and no outer joins are present.SELECT chrom, start, end, __giql_binCTEs and a three-join chain that keeps original table references intact, preventing__giql_binfrom appearing ina.*expansion. Used for wildcard SELECTs with INNER joins.(left_key, right_key)pairs via an INNER binned join with DISTINCT, then outer-join the original tables through the pairs CTE. Used for any query containing outer joins with INTERSECTS predicates. This avoids the bin fan-out problem where multi-bin intervals create spurious NULL rows.Integer floor division for bin indices
Replace
CAST(start / B AS BIGINT)withstart // B. Float division followed by CAST rounds to nearest on engines like DuckDB (e.g.,CAST(621950 / 100 AS BIGINT)yields 6220 instead of 6219), causing missed matches at bin boundaries.Outer join and extra ON condition handling
Propagate the join side (
LEFT,RIGHT,FULL) to both replacement joins in the pairs-CTE path. Extract non-INTERSECTS siblings from AND trees in ON clauses via_extract_non_intersects()and re-attach them to the rewritten join.Code quality improvements
Move
DEFAULT_BIN_SIZEtoconstants.pyand export fromgiql.__init__. Extract shared_build_bin_range()helper to eliminate duplicate bin-computation logic. Replace mutable-list connector counter withitertools.count. Addisinstancecheck forbin_sizeto reject floats early. Rewrite_remove_intersects_from_whereto handle deeply-nested AND trees cleanly.Documentation
Document the DISTINCT deduplication behavior in
docs/dialect/spatial-operators.rstunder a new "Deduplication Behavior" subsection of INTERSECTS, explaining the mechanism, the edge case where genuinely identical rows are collapsed, and the mitigation of including a distinguishing column.Test cases
TestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinbin_size=NoneTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinbin_size=0or negativeTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusion__giql_binin outputTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemanticsTestBinnedJoinBinBoundaryRoundingTestBinnedJoinBinBoundaryRoundingTestBinnedJoinOuterJoinMultiBinTestBinnedJoinOuterJoinMultiBinTestBinnedJoinOuterJoinMultiBinTestBinnedJoinOuterJoinMultiBintest_intersect_propertytest_intersect_propertytest_intersect_propertytest_intersect_propertytest_intersect_property